Goto

Collaborating Authors

 approximation quality



ARelationshipwithEvolutionStrategies(ES) Inthemainpaper,werestrictthegradienttotherandombase

Neural Information Processing Systems

Formally,this constraint also applies to special cases of Natural Evolution Strategies [37, 3]. Similar estimators can be obtained for other symmetric distributions with finite second moment. Moreover,theadditionalhyperparameter σ that determines the magnitude of the perturbation needs to be carefully chosen [33]. Figure B.7: Validation accuracy after 100 epochs and mean gradient correlation with SGD plotted against increasing subspace dimensionality d on the CIFAR-10 CNN (average of three runs). As expected, the mean cosine similarity across 100 pairs of random vectors decreases with growing dimensionality.


An Improved Empirical Fisher Approximation for Natural Gradient Descent

Neural Information Processing Systems

Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementation, the EF approximation has its theoretical and practical limitations. This paper investigates the issue of EF, which is shown to be a major cause of its poor empirical approximation quality. An improved empirical Fisher (iEF) method is proposed to address this issue, which is motivated as a generalised NGD method from a loss reduction perspective, meanwhile retaining the practical convenience of EF.


Wasserstein Regression as a Variational Approximation of Probabilistic Trajectories through the Bernstein Basis

Maslov, Maksim, Kugaevskikh, Alexander, Ivanov, Matthew

arXiv.org Artificial Intelligence

This paper considers the problem of regression over distributions, which is becoming increasingly important in machine learning. Existing approaches often ignore the geometry of the probability space or are computationally expensive. To overcome these limitations, a new method is proposed that combines the parameterization of probability trajectories using a Bernstein basis and the minimization of the Wasserstein distance between distributions. The key idea is to model a conditional distribution as a smooth probability trajectory defined by a weighted sum of Gaussian components whose parameters -- the mean and covariance -- are functions of the input variable constructed using Bernstein polynomials. The loss function is the averaged squared Wasserstein distance between the predicted Gaussian distributions and the empirical data, which takes into account the geometry of the distributions. An autodiff-based optimization method is used to train the model. Experiments on synthetic datasets that include complex trajectories demonstrated that the proposed method provides competitive approximation quality in terms of the Wasserstein distance, Energy Distance, and RMSE metrics, especially in cases of pronounced nonlinearity. The model demonstrates trajectory smoothness that is better than or comparable to alternatives and robustness to changes in data structure, while maintaining high interpretability due to explicit parameterization via control points. The developed approach represents a balanced solution that combines geometric accuracy, computational practicality, and interpretability. Prospects for further research include extending the method to non-Gaussian distributions, applying entropy regularization to speed up computations, and adapting the approach to working with high-dimensional data for approximating surfaces and more complex structures.



An in-depth look at approximation via deep and narrow neural networks

Dommel, Joris, Wegner, Sven A.

arXiv.org Artificial Intelligence

In 2017, Hanin and Sellke showed that the class of arbitrarily deep, real-valued, feed-forward and ReLU-activated networks of width w forms a dense subset of the space of continuous functions on R^n, with respect to the topology of uniform convergence on compact sets, if and only if w>n holds. To show the necessity, a concrete counterexample function f:R^n->R was used. In this note we actually approximate this very f by neural networks in the two cases w=n and w=n+1 around the aforementioned threshold. We study how the approximation quality behaves if we vary the depth and what effect (spoiler alert: dying neurons) cause that behavior.



Checklist

Neural Information Processing Systems

The checklist follows the references. For example: Did you include the license to the code and datasets? Did you include the license to the code and datasets? Did you include the license to the code and datasets? Please do not modify the questions and only use the provided macros for your answers.


d1ff1ec86b62cd5f3903ff19c3a326b2-AuthorFeedback.pdf

Neural Information Processing Systems

We would like to thank the reviewers for their comments, and take the opportunity to answer their questions below. We thank the reviewer for the relevant [Amari et al., 2000] reference, which we will cite and discuss. Similarly, [Amari et al., 2000] considers single-layer networks Further, we examined the method's accuracy relative to recent techniques, and extended it to We are open to changing the term "WoodFisher" which we used as a mnemonic Please see Appendix S5 for ablation studies. For simplicity, we consider the scaling constant as 1 here. Thanks for the suggestions, we will correct the font sizes & the broken references.


An Improved Empirical Fisher Approximation for Natural Gradient Descent

Neural Information Processing Systems

Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementation, the EF approximation has its theoretical and practical limitations. This paper investigates the inversely-scaled projection issue of EF, which is shown to be a major cause of its poor empirical approximation quality. An improved empirical Fisher (iEF) method is proposed to address this issue, which is motivated as a generalised NGD method from a loss reduction perspective, meanwhile retaining the practical convenience of EF.